Skip to content

[Cosmos] Session container fixes new branch #41678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 71 commits into
base: main
Choose a base branch
from

Conversation

bambriz
Copy link
Member

@bambriz bambriz commented Jun 20, 2025

Description

Moved this PR 40366 Due to pipeline Issues, please see description of that PR for information on Session container fixes.

Original PR Description:

This PR initially aimed at closing several gaps in the session token handling logic of the Python SDK's session container, specifically sending the entire compound session token for a container for every request, but as a result of this has now grown beyond that in scope. This PR now also addresses and closes the following issues:

Pending items

  • [Cosmos] make queries fetch query plan in every query #38577 - ensuring we send query plan calls for every cross-partition query. this was addressed here since cross-partition queries that are sent without query plans do not provide partition information that can be used by the session container, and as such would be forced to either 1.) be sent without a continuation token, basically running in Eventual consistency, or 2.) be sent with the entire compound token for the container (which breaks users that have too many partitions due to request header size length). This is also the first follow up item marked in the PPCB PR, and would be needed to ensure we are fully covered on that front: Per Partition Circuit Breaker #40302. This is pending because pagination logic seemed to have completely broken after splitting this up. Further work will be needed to ensure our query pipeline can handle this scenario properly.

Current state

The Python SDK currently does several things that should be improved upon for session consistency behaviors:

  • We currently send out a session token for every single request so long as the default account consistency is Session, which is undesired behavior for write operations in single-write region scenarios.
  • The session token that we send out with our requests is a compound session token including every single partition in the container, which is unfeasible for large accounts since these can become large enough to cause request size issues.
  • The SDK had no pk cache refreshing logic for partition split scenarios since we don't receive 410/1002 status codes to react to for normal requests sending out a partition key value and not a partition key range id.
  • The SDK was not updating session tokens after read requests, allowing stale reads for workloads if other clients are interacting with the same container resource.

Changes introduced

In order to address the above issues, the following changes have been made:

  • We will now only send out session tokens for the relevant requests under session consistency - read operations, batch operations, or requests sent by multi-write configured accounts.
  • The session token that now gets sent out will only have the relevant information for its partition the same way the .NET and Java SDKs do, only sending the minimum information: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/SessionTokenHelper.java#L45
  • Now, once we receive a partition key range id in the response headers that is unaccounted for in the partition key range cache, we will force a refresh to the cache in order to obtain all the new ids to be used in session token computing for subsequent requests.
  • We now update session tokens on read requests as well, ensuring all requests are fetching the newest available session token.

Caveats

We currently only initiate the session container within a client if the user properly initializes their client. While this is not a problem for the sync client, it means that users that are not directly initializing their asynchronous clients as outlined in our README will not be able to leverage the session container, and will have to implement their own session token handling logic to achieve session consistency.

@bambriz
Copy link
Member Author

bambriz commented Jun 20, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

github-actions bot commented Jun 20, 2025

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure-cosmos

reduces timeout from 25 minutes to 7 minutes
@bambriz
Copy link
Member Author

bambriz commented Jun 21, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bambriz
Copy link
Member Author

bambriz commented Jun 23, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bambriz
Copy link
Member Author

bambriz commented Jun 27, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Fixes tests to accomodate the extra readfeed that may  happen during a read item operation. This also includes other general test fixes.
@bambriz
Copy link
Member Author

bambriz commented Jun 30, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bambriz
Copy link
Member Author

bambriz commented Jul 1, 2025

/azp run python - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants